Skip to content

impl(o11y): introduce error attributes#12189

Merged
diegomarquezp merged 45 commits intomainfrom
observability/tracing-attr-error-type-transfer
Apr 1, 2026
Merged

impl(o11y): introduce error attributes#12189
diegomarquezp merged 45 commits intomainfrom
observability/tracing-attr-error-type-transfer

Conversation

@diegomarquezp
Copy link
Copy Markdown
Contributor

@diegomarquezp diegomarquezp commented Mar 24, 2026

This PR implements error type recording on attempt spans for improved observability in gax-java, addressing requirements for better failure analysis.

Key Changes

  • ErrorTypeUtil Class: This new utility class classifies errors based on a defined priority to populate observability attributes.

Error Classification Priority

The extraction logic determines the error type based on the following priority:

  1. google.rpc.ErrorInfo.reason: If the error response from the service includes ErrorInfo details, the reason field (e.g., RATE_LIMIT_EXCEEDED) is used.
  2. Server Status Code: If no reason is available, it checks for a server status code. For HTTP, this is the numeric status code (e.g., 403, 503). For gRPC, this is the status code name (e.g., PERMISSION_DENIED, UNAVAILABLE).
  3. Client-Side Network/Operational Errors: If it's a client-side failure, it maps common exceptions to specific enum representations (e.g., CLIENT_TIMEOUT, CLIENT_CONNECTION_ERROR).
  4. Language-specific error type: Falls back to the class simple name of the exception (e.g., NullPointerException).
  5. Internal Fallback: Defaults to INTERNAL if no other classification applies.

Exceptions to be Unwrapped

We investigated standard execution wrappers to ensure accurate error classification in ErrorTypeUtil. We only found one exception so far that needs unwrapping in this context.

UncheckedExecutionException

Occurs in ServerStreamIterator.java when wrapping checked exceptions observed during stream iteration.

    if (last instanceof Throwable) {
      Throwable throwable = (Throwable) last;
      throw new UncheckedExecutionException(throwable);
    }

It is also thrown in ApiExceptions.java during synchronous call translation:

  public static <ResponseT> ResponseT callAndTranslateApiException(ApiFuture<ResponseT> future) {
    try {
      return Futures.getUnchecked(future);
    } catch (UncheckedExecutionException exception) {
      if (exception.getCause() instanceof RuntimeException) {
        // ...
      }
      throw exception;
    }
  }

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly refines the tracing telemetry by introducing dedicated attributes for error types, exception types, and status messages. These additions provide a more detailed and standardized way to categorize and understand failures occurring within client-side operations, improving observability and debugging capabilities. The changes ensure that critical error information is consistently captured in OpenTelemetry spans, offering clearer insights into the root causes of issues.

Highlights

  • Enhanced Error Telemetry: Introduced new OpenTelemetry attributes (error.type, status.message, exception.type) to provide more granular details on client-side errors within tracing spans.
  • Standardized Error Type Extraction: Added a new utility class ErrorTypeUtil to consistently extract low-cardinality error types from Throwable objects, prioritizing google.rpc.ErrorInfo.reason, client-side network/operational errors, specific server codes, and language-specific exception names.
  • Span Attribute Population: Modified SpanTracer to automatically populate the new error telemetry attributes on attempt spans when operations fail, including a recursive search for the most relevant error message.
  • API Tracer Update: Extended the ApiTracer interface with a new default method requestSent(long requestSize) to allow for tracking the size of streaming requests.
  • Comprehensive Testing: Added extensive unit and integration tests to validate the correct extraction and recording of error types and messages across various client-side and server-side failure scenarios for both gRPC and HTTP/JSON transports.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new ErrorTypeUtil class to provide a standardized way of classifying exceptions into specific error types (e.g., client timeout, connection error, authentication error) for OpenTelemetry tracing. It adds new observability attributes (error.type, exception.type, status.message) and integrates this error classification into the SpanTracer to enrich span data upon failed attempts. The changes also include comprehensive unit and integration tests to validate the new error type extraction and tracing functionality. Feedback from the review includes correcting incorrect copyright years, removing redundant semicolons and toString() overrides, and updating Javadoc for accuracy.

@diegomarquezp diegomarquezp changed the title feat: Refine tracing telemetry for client-side attributes impl(o11y): introduce error attributes Mar 24, 2026
@diegomarquezp diegomarquezp marked this pull request as ready for review March 26, 2026 18:48
@diegomarquezp diegomarquezp requested a review from a team as a code owner March 26, 2026 18:48
@diegomarquezp diegomarquezp marked this pull request as draft March 26, 2026 20:32
@diegomarquezp
Copy link
Copy Markdown
Contributor Author

01:03:56:592 [ERROR] Failed to execute goal org.graalvm.buildtools:native-maven-plugin:0.10.6:test (test-native) on project google-auth-library-credentials: Execution test-native of goal org.graalvm.buildtools:native-maven-plugin:0.10.6:test failed: Test configuration file wasn't found. -> [Help 1]
01:03:56:593 [ERROR] Failed to execute goal org.graalvm.buildtools:native-maven-plugin:0.10.6:test (test-native) on project api-common: Execution test-native of goal org.graalvm.buildtools:native-maven-plugin:0.10.6:test failed: Test configuration file wasn't found. -> [Help 1]
01:03:56:593 [ERROR] Failed to execute goal org.graalvm.buildtools:native-maven-plugin:0.10.6:test (test-native) on project google-auth-library-appengine: Execution test-native of goal org.graalvm.buildtools:native-maven-plugin:0.10.6:test failed: Test configuration file wasn't found. -> [Help 1]
01:03:56:593 [ERROR] Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:3.5.2:test (default-test) on project google-auth-library-cab-token-generator: 

GraalVM failures seem unrelated

@diegomarquezp diegomarquezp marked this pull request as ready for review March 27, 2026 15:43
Copy link
Copy Markdown
Contributor Author

@diegomarquezp diegomarquezp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments addressed


public class ErrorTypeUtil {

public enum ErrorType {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


private String extractErrorMessage(Throwable error) {
Throwable cause = error;
while (cause != null) {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the catch. Done.

endAttempt();
}

private String extractErrorMessage(Throwable error) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is not used?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it was there before a refactor. Thanks for the catch. Removing.

Status.newBuilder().setCode(com.google.rpc.Code.UNAVAILABLE.ordinal()).build())
.build();

assertThrows(UnavailableException.class, () -> client.echo(echoRequest));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this from the interceptor or the error in EchoRequest? I see that UNAVAILABLE is set on both places.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

setCode(com.google.rpc.Code.UNAVAILABLE.ordinal()) should not be here. We should only rely on the interceptor.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

if (t.getCause() == null) {
return t;
}
if (t instanceof ExecutionException || t instanceof UncheckedExecutionException) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess there is a concrete use case for this scenario?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed only one exception should be unwrapped and added documentation both in javadoc and this PR's description.

@diegomarquezp diegomarquezp enabled auto-merge (squash) April 1, 2026 19:40
@diegomarquezp diegomarquezp merged commit d2edbfb into main Apr 1, 2026
117 of 118 checks passed
@diegomarquezp diegomarquezp deleted the observability/tracing-attr-error-type-transfer branch April 1, 2026 20:14
lqiu96 pushed a commit that referenced this pull request Apr 1, 2026
This PR implements error type recording on attempt spans for improved
observability in `gax-java`, addressing requirements for better failure
analysis.

#### Key Changes

* **`ErrorTypeUtil` Class**: This new utility class classifies errors
based on a defined priority to populate observability attributes.

### Error Classification Priority

The extraction logic determines the error type based on the following
priority:

1. **`google.rpc.ErrorInfo.reason`**: If the error response from the
service includes `ErrorInfo` details, the reason field (e.g.,
`RATE_LIMIT_EXCEEDED`) is used.
2. **Server Status Code**: If no reason is available, it checks for a
server status code. For HTTP, this is the numeric status code (e.g.,
`403`, `503`). For gRPC, this is the status code name (e.g.,
`PERMISSION_DENIED`, `UNAVAILABLE`).
3. **Client-Side Network/Operational Errors**: If it's a client-side
failure, it maps common exceptions to specific enum representations
(e.g., `CLIENT_TIMEOUT`, `CLIENT_CONNECTION_ERROR`).
4. **Language-specific error type**: Falls back to the class simple name
of the exception (e.g., `NullPointerException`).
5. **Internal Fallback**: Defaults to `INTERNAL` if no other
classification applies.

### Exceptions to be Unwrapped

We investigated standard execution wrappers to ensure accurate error
classification in `ErrorTypeUtil`. We only found one exception so far
that needs unwrapping in this context.

#### `UncheckedExecutionException`
Occurs in
[ServerStreamIterator.java](https://github.com/googleapis/sdk-platform-java/blob/main/gax-java/gax/src/main/java/com/google/api/gax/rpc/ServerStreamIterator.java)
when wrapping checked exceptions observed during stream iteration.

```java
    if (last instanceof Throwable) {
      Throwable throwable = (Throwable) last;
      throw new UncheckedExecutionException(throwable);
    }
```

It is also thrown in
[ApiExceptions.java](https://github.com/googleapis/sdk-platform-java/blob/main/gax-java/gax/src/main/java/com/google/api/gax/rpc/ApiExceptions.java)
during synchronous call translation:

```java
  public static <ResponseT> ResponseT callAndTranslateApiException(ApiFuture<ResponseT> future) {
    try {
      return Futures.getUnchecked(future);
    } catch (UncheckedExecutionException exception) {
      if (exception.getCause() instanceof RuntimeException) {
        // ...
      }
      throw exception;
    }
  }
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants